{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (c) Microsoft Corporation.  \n",
    "Licensed under the MIT License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Abstractive Summarization using UniLM on CNN/DailyMails"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Before you start\n",
    "Set `QUICK_RUN = True` to run the notebook on a small subset of data and a smaller number of steps. If `QUICK_RUN = False`, the notebook takes about 9 hours to run on a VM with 4 16GB NVIDIA V100 GPUs. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "QUICK_RUN = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "This notebook demostrates how to fine-tune the [Unified Language Model](https://arxiv.org/abs/1905.03197) (UniLM) for abstractive summarization task. Utility functions and classes in the microsoft/nlp-recipes repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation.\n",
    "\n",
    "### Abstractive Summarization\n",
    "Abstractive summarization is the task of taking an input text and summarizing its content in a shorter output text. In contrast to extractive summarization, abstractive summarization doesn't take sentences directly from the input text, instead, rephrases the input text.\n",
    "\n",
    "### UniLM\n",
    "UniLM is a state of the art model developed by Microsoft Research Asia (MSRA). The model is first pre-trained on a large unlabeled natural language corpus (English Wikipedia and BookBorpus) and can be fine-tuned on different types of labeled data for various NLP tasks like text classification and abstractive summarization.   \n",
    "The figure below shows the UniLM architecture. During pre-training, the model parameters are shared across the LM objectives (i.e., bidirectional LM, unidirectional LM, and sequence-to-sequence LM). For different NLP tasks, UniLM uses different self-attention masks to control the access to context for each word token.  \n",
    "The seq-to-seq LM in the third row in the figure is used in summarization task. In seq-to-seq LM, word tokens in the input sequence can access all the other tokens in the input sequence, but can not access the word tokens in the output sequence. Word tokens in the output sequence can access all the tokens in the input sequence and the tokens in the output sequence generated before the current position. \n",
    "<img src=\"https://nlpbp.blob.core.windows.net/images/unilm_architecture.PNG\" width=\"600\" height=\"600\">\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "import os\n",
    "import shutil\n",
    "from tempfile import TemporaryDirectory\n",
    "import pprint\n",
    "import scrapbook as sb\n",
    "import sys\n",
    "import time\n",
    "import torch\n",
    "\n",
    "nlp_path = os.path.abspath(\"../../\")\n",
    "if nlp_path not in sys.path:\n",
    "    sys.path.insert(0, nlp_path)\n",
    "\n",
    "from utils_nlp.dataset.cnndm import CNNDMSummarizationDatasetOrg\n",
    "from utils_nlp.models.transformers.abstractive_summarization_seq2seq import S2SAbsSumProcessor, S2SAbstractiveSummarizer\n",
    "from utils_nlp.eval import compute_rouge_python\n",
    "\n",
    "from utils_nlp.models.transformers.datasets import SummarizationDataset\n",
    "from utils_nlp.dataset.cnndm import detokenize\n",
    "\n",
    "start_time = time.time()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# model parameters\n",
    "MODEL_NAME = \"unilm-base-cased\"\n",
    "MAX_SEQ_LENGTH = 768\n",
    "MAX_SOURCE_SEQ_LENGTH = 640\n",
    "MAX_TARGET_SEQ_LENGTH = 128\n",
    "\n",
    "# use 0 for CPU\n",
    "NUM_GPUS =  torch.cuda.device_count()\n",
    "\n",
    "# fine-tuning parameters\n",
    "TRAIN_PER_GPU_BATCH_SIZE = 1\n",
    "GRADIENT_ACCUMULATION_STEPS = 2\n",
    "LEARNING_RATE = 3e-5\n",
    "\n",
    "TOP_N = -1\n",
    "WARMUP_STEPS = 500\n",
    "MAX_STEPS = 5000\n",
    "BEAM_SIZE = 5\n",
    "if QUICK_RUN:\n",
    "    TOP_N = 100\n",
    "    WARMUP_STEPS = 5\n",
    "    MAX_STEPS = 50\n",
    "    BEAM_SIZE = 3\n",
    "    if NUM_GPUS == 0:\n",
    "        TOP_N = 5\n",
    "        MAX_STEPS = 10\n",
    "\n",
    "# inference parameters\n",
    "TEST_PER_GPU_BATCH_SIZE = 12\n",
    "FORBID_IGNORE_WORD = \".\"\n",
    "\n",
    "# mixed precision setting. To enable mixed precision training, follow instructions in SETUP.md. \n",
    "# You will be able to increase the batch sizes with mixed precision training.\n",
    "FP16 = False\n",
    "\n",
    "DATA_DIR = TemporaryDirectory().name\n",
    "CACHE_DIR = TemporaryDirectory().name\n",
    "MODEL_DIR = \".\"\n",
    "RESULT_DIR = \".\"\n",
    "OUTPUT_FILE = os.path.join(RESULT_DIR, 'nlp_cnndm_finetuning_results.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the CNN/DailyMail dataset\n",
    "The [CNN/DailyMail dataset](https://cs.nyu.edu/~kcho/DMQA/) was original introduced for Q&A research. There are multiple versions of the dataset processed for summarization task available on the web. The `CNNDMSummarizationDatasetOrg` function downloads a version from the [UniLM repo](https://github.com/microsoft/unilm) with minimal processing. The function returns the training and testing dataset as `SummarizationDataset` which can be further processed for model training and testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "train_ds, test_ds = CNNDMSummarizationDatasetOrg(local_path=DATA_DIR, top_n=TOP_N)\n",
    "print(len(train_ds))\n",
    "print(len(test_ds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing\n",
    "The `S2SAbsSumProcessor` has multiple methods for converting input data in `SummarizationDataset`, `IterableSummarizationDataset` or json files into the format required for model training and testing. The preprocessing steps include\n",
    "- Tokenize input text\n",
    "- Convert tokens into token ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processor = S2SAbsSumProcessor(model_name=MODEL_NAME, cache_dir=CACHE_DIR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cached_features_file_train = os.path.join(RESULT_DIR, \"cached_features_for_training.pt\")\n",
    "cached_features_file_test = os.path.join(RESULT_DIR, \"cached_features_for_testing.pt\")\n",
    "train_dataset = processor.s2s_dataset_from_sum_ds(train_ds, cached_features_file=cached_features_file_train, train_mode=True)\n",
    "test_dataset = processor.s2s_dataset_from_sum_ds(test_ds, cached_features_file=cached_features_file_test, train_mode=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine tune model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `S2SAbstractiveSummarizer` loads a pre-trained UniLM model specified by `model_name`.  \n",
    "Call `S2SAbstractiveSummarizer.list_supported_models()` to see all the supported models.  \n",
    "If you want to use a model on the local disk, specify `load_model_from_dir` and `model_file_name`. This is particularly useful if you want to load a previously fine-tuned model and use it for inference directly without fine-tuning. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "abs_summarizer = S2SAbstractiveSummarizer(\n",
    "    model_name=MODEL_NAME,\n",
    "    max_seq_length=MAX_SEQ_LENGTH,\n",
    "    max_source_seq_length=MAX_SOURCE_SEQ_LENGTH,\n",
    "    max_target_seq_length=MAX_TARGET_SEQ_LENGTH,\n",
    "    cache_dir=CACHE_DIR\n",
    ")\n",
    "\n",
    "## To load a model on the local disk\n",
    "# abs_summarizer = S2SAbstractiveSummarizer(\n",
    "#     model_name=MODEL_NAME,\n",
    "#     max_seq_len=MAX_SEQ_LEN,\n",
    "#     max_source_seq_length=MAX_SOURCE_SEQ_LENGTH,\n",
    "#     max_target_seq_length=MAX_TARGET_SEQ_LENGTH,\n",
    "#     load_model_from_dir=\"./\",\n",
    "#     model_file_name=\"model.5000.bin\",\n",
    "# )\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "abs_summarizer.fit(\n",
    "    train_dataset=train_dataset,\n",
    "    num_gpus=NUM_GPUS,\n",
    "    per_gpu_batch_size=TRAIN_PER_GPU_BATCH_SIZE,\n",
    "    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,\n",
    "    learning_rate=LEARNING_RATE,\n",
    "    warmup_steps=WARMUP_STEPS,\n",
    "    max_steps=MAX_STEPS,\n",
    "    fp16=FP16,\n",
    "    save_model_to_dir=MODEL_DIR\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate summaries on testing dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "predictions = abs_summarizer.predict(\n",
    "    test_dataset=test_dataset,\n",
    "    num_gpus=NUM_GPUS,\n",
    "    per_gpu_batch_size=TEST_PER_GPU_BATCH_SIZE,\n",
    "    beam_size=BEAM_SIZE,\n",
    "    forbid_ignore_word=FORBID_IGNORE_WORD,\n",
    "    fp16=FP16\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for r in predictions[:TOP_N]:\n",
    "    print(r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_ds.get_source()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_ds.get_target()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(OUTPUT_FILE, 'w', encoding=\"utf-8\") as f:\n",
    "    for line in predictions:\n",
    "        f.write(line + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction on a single input sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "source = \"\"\"\n",
    "But under the new rule, set to be announced in the next 48 hours, Border Patrol agents would immediately return anyone to Mexico — without any detainment and without any due process — who attempts to cross the southwestern border between the legal ports of entry. The person would not be held for any length of time in an American facility.\n",
    "\n",
    "Although they advised that details could change before the announcement, administration officials said the measure was needed to avert what they fear could be a systemwide outbreak of the coronavirus inside detention facilities along the border. Such an outbreak could spread quickly through the immigrant population and could infect large numbers of Border Patrol agents, leaving the southwestern border defenses weakened, the officials argued.\n",
    "The Trump administration plans to immediately turn back all asylum seekers and other foreigners attempting to enter the United States from Mexico illegally, saying the nation cannot risk allowing the coronavirus to spread through detention facilities and Border Patrol agents, four administration officials said.\n",
    "The administration officials said the ports of entry would remain open to American citizens, green-card holders and foreigners with proper documentation. Some foreigners would be blocked, including Europeans currently subject to earlier travel restrictions imposed by the administration. The points of entry will also be open to commercial traffic.\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "singel_test_ds = SummarizationDataset(\n",
    "    None, source=[source], source_preprocessing=[detokenize],\n",
    ")\n",
    "single_test_dataset = processor.s2s_dataset_from_sum_ds(singel_test_ds, train_mode=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "single_prediction = abs_summarizer.predict(\n",
    "    test_dataset=single_test_dataset,\n",
    "    num_gpus=NUM_GPUS,\n",
    "    per_gpu_batch_size=1,\n",
    "    beam_size=BEAM_SIZE,\n",
    "    forbid_ignore_word=FORBID_IGNORE_WORD,\n",
    "    fp16=FP16\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "single_prediction[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "We provide utility functions for evaluating summarization models and details can be found in the [summarization evaluation notebook](./summarization_evaluation.ipynb).  \n",
    "For the settings in this notebook with QUICK_RUN=False, you should get ROUGE scores close to the following numbers:  \n",
    "{'rouge-1': {'f': 0.37109626943068647,\n",
    "  'p': 0.4692792272280924,\n",
    "  'r': 0.33322322114381886},  \n",
    " 'rouge-2': {'f': 0.1690495786379728,\n",
    "  'p': 0.21782900161918375,\n",
    "  'r': 0.15079122430118444},  \n",
    " 'rouge-l': {'f': 0.2671310062443078,\n",
    "  'p': 0.3414039392451434,\n",
    "  'r': 0.2392756715930202}}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rouge_scores = compute_rouge_python(cand=predictions, ref=test_ds.get_target())\n",
    "pprint.pprint(rouge_scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distributed training with DistributedDataParallel (DDP)\n",
    "This notebook uses DataParallel for multi-GPU training by default. In general, DistributedDataParallel(DDP) is recommended because of its better performance. See details [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).  \n",
    "Since DDP requires multiprocess and can not be run from the notebook, we provide a python script [abstractive_summarization_unilm_cnndm.py](./abstractive_summarization_unilm_cnndm.py) to demonstrate how to use DDP.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we save the training and testing dataset to jsonlines files to be used by the python script. This avoids multiple processes repeating the initial data pre-processing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_ds.save_to_jsonl(os.path.join(RESULT_DIR, \"train_ds.jsonl\"))\n",
    "test_ds.save_to_jsonl(os.path.join(RESULT_DIR, \"test_ds.jsonl\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can execute the Python script using torch.distributed.launch, set `--nproc_per_node` to the number of GPUs on your machine.\n",
    "\n",
    "```\n",
    "python -m torch.distributed.launch --nproc_per_node=4 --nnode=1 abstractive_summarization_unilm_cnndm.py\n",
    "```\n",
    "\n",
    "**Note that the python script set `fp16=False` by default. If you have enabled mixed precision training following the instructions in [SETUP.md](\"../../SETUP.md\"), you can call the script with an additional argument \"--fp16 true\". You will be able to increase the batch sizes with mixed precision training.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if os.path.exists(DATA_DIR):\n",
    "    shutil.rmtree(DATA_DIR, ignore_errors=True)\n",
    "if os.path.exists(CACHE_DIR):\n",
    "    shutil.rmtree(CACHE_DIR, ignore_errors=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Total notebook running time {}\".format(time.time() - start_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for testing\n",
    "sb.glue(\"rouge_1_f_score\", rouge_scores[\"rouge-1\"][\"f\"])\n",
    "sb.glue(\"rouge_2_f_score\", rouge_scores[\"rouge-2\"][\"f\"])\n",
    "sb.glue(\"rouge_l_f_score\", rouge_scores[\"rouge-l\"][\"f\"])"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (nlp_gpu)",
   "language": "python",
   "name": "nlp_gpu"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
